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Abstract. Accurate signal recovery or image reconstruction from indirect and possibly under- 
sampled data is a topic of considerable interest; for example, the literature in the recent field of 
compressed sensing is already quite immense. Inspired by recent breakthroughs in the development 
of novel first-order methods in convex optimization, most notably Nesterov's smoothing technique, 
this paper introduces a fast and accurate algorithm for solving common recovery problems in signal 
processing. In the spirit of Nesterov's work, one of the key ideas of this algorithm is a subtle averag- 
ing of sequences of iterates, which has been shown to improve the convergence properties of standard 
gradient-descent algorithms. This paper demonstrates that this approach is ideally suited for solving 
large-scale compressed sensing reconstruction problems as 1) it is computationally efficient, 2) it is 
accurate and returns solutions with several correct digits, 3) it is flexible and amenable to many kinds 
of reconstruction problems, and 4) it is robust in the sense that its excellent performance across a 
wide range of problems does not depend on the fine tuning of several parameters. Comprehensive 
numerical experiments on realistic signals exhibiting a large dynamic range show that this algorithm 
compares favorably with recently proposed state-of-the-art methods. We also apply the algorithm 
to solve other problems for which there are fewer alternatives, such as total-variation minimization, 
and convex programs seeking to minimize the £i norm of Wx under constraints, in which W is not 
diagonal. 

Key words. Nesterov's method, smooth approximations of nonsmooth functions, £i mini- 
mization, duality in convex optimization, continuation methods, compressed sensing, total-variation 
minimization. 

1. Introduction. Compressed sensing (CS) [13, 14, 25] is a novel sampling the- 
ory, which is based on the revelation that one can exploit sparsity or compressibility 
when acquiring signals of general interest. In a nutshell, compressed sensing designs 
nonadaptive sampling techniques that condense the information in a compressible 
signal into a small amount of data. There are some indications that because of 
the significant reduction in the number of measurements needed to recover a signal 
accurately, engineers are changing the way they think about signal acquisition in ar- 
eas ranging from analog-to-digital conversion [23], digital optics, magnetic resonance 
imaging [38], seismics [37] and astronomy [8]. 

In this field, a signal € M" is acquired by collecting data of the form 

b = Ax° + z, 

where x*' is the signal of interest (or its coefficient sequence in a representation where 
it is assumed to be fairly sparse), A is a known m x n "sampling" matrix, and z is a 
noise term. In compressed sensing and elsewhere, a standard approach attempts to 
reconstruct a;° by solving 

minimize f{^) \ 
subject to \\b — AxWi^ < e, ^ ' ' 

where is an estimated upper bound on the noise power. The choice of the reg- 
ularizing function / depends on prior assumptions about the signal of interest: 
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if is (approximately) sparse, an appropriate convex function is the £i norm (as 
advocated by the CS theory); if is a piecewise constant object, the total- variation 
norm provides accurate recovery results, and so on. 

Solving large-scale problems such as (1.1) (think of x'^ as having millions of entries 
as in mega-pixel images) is challenging. Although one cannot review the vast literature 
on this subject, the majority of the algorithms that have been proposed are unable to 
solve these problems accurately with low computational complexity. On the one hand, 
standard second-order methods such as interior-point methods [10, 36, 48] are accurate 
but problematic for they need to solve large systems of linear equations to compute the 
Newton steps. On the other hand, inspired by iterative thresholding ideas [24, 30, 20], 
we have now available a great number of first-order methods, see [31, 9, 34, 35] and 
the many earlier references therein, which may be faster but not necessarily accurate. 
Indeed, these methods are shown to converge slowly, and typically need a very large 
number of iterations when high accuracy is required. 

We would like to pause on the demand for high accuracy since this is the main 
motivation of the present paper. While in some applications, one may be content with 
one or two digits of accuracy, there are situations in which this is simply unacceptable. 
Imagine that the matrix A models a device giving information about the signal ^ 
such as an analog-to-digital converter, for example. Here, the ability to detect and 
recover low-power signals that are barely above the noise floor, and possibly further 
obscured by large interferers, is critical to many applications. In mathematical terms, 
one could have a superposition of high power signals corresponding to components 
x^[i\ of x'^ with magnitude of order 1, and low power signals with amplitudes as far 
as 100 dB down, corresponding to components with magnitude about 10"^. In this 
regime of high-dynamic range, very high accuracy is required. In the example above, 
one would need at least five digits of precision as otherwise, the low power signals 
would go undetected. 

Another motivation is solving (1.1) accurately when the signal a;" is not exactly 
sparse, but rather approximately sparse, as in the case of real-world compressible sig- 
nals. Since exactly sparse signals are rarely found in applications — while compressible 
signals are ubiquitous — it is important to have an accurate first-order method to han- 
dle realistic signals. 

1.1. Contributions. A few years ago, Nesterov [43] published a seminal paper 
which couples smoothing techniques (see [4] and the references therein) with an im- 
proved gradient method to derive first-order methods which achieve a convergence 
rate he had proved to be optimal [41] two decades earlier. As a consequence of this 
breakthrough, a few recent works have followed up with improved techniques for some 
very special problems in signal or image processing, see [3, 21, 52, 1] for example, or 
for minimizing composite functions such as £i-regularized least-squares problems [44]. 
In truth, these novel algorithms demonstrate great promise; they are fast, accurate 
and robust in the sense that their performance does not depend on the fine tuning of 
various controlling parameters. 

This paper also builds upon Nesterov's work by extending some of his works dis- 
cussed just above, and proposes an algorithm — or, better said, a class of algorithms — 
for solving recovery problems from incomplete measurements. We refer to this al- 
gorithm as NESTA — a shorthand for Nesterov's algorithm — to acknowledge the fact 
that it is based on his method. The main purpose and the contribution of this paper 
consist in showing that NESTA obeys the following desirable properties. 

1. Speed: NESTA is an iterative algorithm where each iteration is decomposed 
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into three steps, each involving only a few matrix- vector operations when A* A is an 
orthogonal projector and, more generally, when the eigenvalues of A* A are well clus- 
tered. This, together with the accelerated convergence rate of Nesterov's algorithm 
[43, 3], makes NESTA a method of choice for solving large-scale problems. Further- 
more, NESTA's convergence is mainly driven by a single smoothing parameter /x 
introduced in Section 2. One can use continuation techniques [34, 35] to dynamically 
update this parameter to substantially accelerate this algorithm. 

2. Accuracy: NESTA depends on a few parameters that can be set in a very 
natural fashion. In fact, there is a trivial relationship between the value of these pa- 
rameters and the desired accuracy. Furthermore, our numerical experiments demon- 
strate that NESTA can find the first 4 or 5 significant digits of the optimal solution 
to (1.1), where f{x) is the £i norm or the total-variation norm of x, in a few hundred 
iterations. This makes NESTA amenable to solve recovery problems involving signals 
of very large sizes that also exhibit a great dynamic range. 

3. Flexibility: NESTA can be adapted to solve many problems beyond £i min- 
imization with the same efliciency, such as total-variation (TV) minimization prob- 
lems. In this paper, we will also discuss applications in which / in (1.1) is given 
by f{x) = \\Wx\\i-^, where one may think of as a short-time Fourier transform 
also known as the Gabor transform, a curvelet transform, an undecimated wavelet 
transform and so on, or a combination of these, or a general arbitrary dictionary of 
waveforms (note that this class of recovery problems also include weighted £i methods 
[16]). This is particularly interesting because recent work [29] suggests the potential 
advantage of this analysis-based approach over the classical basis pursuit in solving 
important inverse problems [29]. 

A consequence of these properties is that NESTA, and more generally Nesterov's 
method, may be of interest to researchers working in the broad area of signal recovery 
from indirect and/or undersampled data. 

Another contribution of this paper is that it also features a fairly wide range 
of numerical experiments comparing various methods against problems involving re- 
alistic and challenging data. By challenging, we mean problems of very large scale 
where the unknown solution exhibits a large dynamic range; that is, problems for 
which classical second-order methods are too slow, and for which standard first-order 
methods do not provide sufficient accuracy. More specifically. Section 5 presents a 
comprehensive series of numerical experiments which illustrate the behavior of sev- 
eral state-of-the-art methods including interior point methods [36] , projected gradient 
techniques [34, 51, 31], fixed point continuation and iterative thresholding algorithms 
[34, 56, 3]. It is important to consider that most of these methods have been perfected 
after several years of research [36, 31], and did not exist two years ago. For example, 
the Fixed Point Continuation method with Active Set [35] , which represents a notable 
improvement over existing ideas, was released while we were working on this paper. 

1.2. Organization of the paper and notations. As emphasized earlier, NESTA 
is based on Nesterov's ideas and Section 2 gives a brief but essential description of 
Nesterov's algorithmic framework. The proposed algorithm is introduced in Section 3. 
Inspired by continuation-like schemes, an accelerated version of NESTA is described 
in Section 3.6. We report on extensive and comparative numerical experiments in Sec- 
tion 5. Section 6 covers extensions of NESTA to minimize the £i norm of Wx under 
data constraints (Section 6.1), and includes realistic simulations in the field of radar 
pulse detection and estimation. Section 6.3 extends NESTA to solve total-variation 
problems and presents numerical experiments which also demonstrate its remarkable 
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efficiency there as well. Finally, we conclude with Section 7 discussing further exten- 
sions, which would address an even wider range of linear inverse problems. 

Notations. Before we begin, it is best to provide a brief summary of the notations 
used throughout the paper. As usual, vectors are written in small letters and matrices 
in capital letters. The ith entry of a vector x is denoted x[i] and the (z, j)th entry of 
the matrix A is 

It is convenient to introduce some common optimization problems that will be 
discussed throughout. Solving sparse reconstruction problems can be approached via 
several different equivalent formulations. In this paper, we particularly emphasize the 
quadratically constrained ^i-minimization problem 

(BPc) minimize WxWe^ 2^ 
subject to \\b—Ax\\£2<e, 

where e quantifies the uncertainty about the measurements b as in the situation where 
the measurements are noisy. This formulation is often preferred because a reasonable 
estimate of e may be known. A second frequently discussed approach considers solving 
this problem in Lagrangian form, i.e. 

(QPJ minimize X\\x\U, + ^\\b ^ Ax\\l, (1.3) 

and is also known as the basis pursuit denoising problem (BPDN) [18]. This problem 
is popular in signal and image processing because of its loose interpretation as a 
maximum a posteriori estimate in a Bayesian setting. In statistics, the same problem 
is more well-known as the lasso [49] 

(LSt-) minimize ||5— yla;||f2 /, 

subject to lla^llfi < T. \ ■ ) 

Standard optimization theory [47] asserts that these three problems are of course 
equivalent provided that e, A,t obey some special relationships. With the exception 
of the case where the matrix A is orthogonal, this functional dependence is hard 
to compute [51]. Because it is usually more natural to determine an appropriate e 
rather than an appropriate A or t, the fact that NESTA solves (BP,:) is a significant 
advantage. Further, note that theoretical equivalence of course does not mean that all 
three problems are just as easy (or just as hard) to solve. For instance, the constrained 
problem (BP^) is harder to solve than (QP;^), as discussed in Section 5.2. Therefore, 
the fact that NESTA turns out to be competitive with algorithms that only solve 
(QPa) is quite remarkable. 

2. Nesterov's method. 

2.1. Minimizing smooth convex functions. In [42, 41], Nesterov introduces 
a subtle algorithm to minimize any smooth convex function / on the convex set Qp, 

min fix). (2.1) 

We will refer to Qp as the primal feasible set. The function / is assumed to be 
differentiable and its gradient V/(x) is Lipschitz and obeys 

\\^f{x)~Wf{y)\U,<L\\x~y\\e,; (2.2) 
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in short, L is an upper bound on the Lipschitz constant. With these assumptions, 
Nesterov's algorithm minimizes / over Qp by iteratively estimating three sequences 
{xk\, {yk\ and {zk\ while smoothing the feasible set Qp. The algorithm depends on 
two scalar sequences {ctfe} and {rfc} discussed below, and takes the following form: 



Initialize xq. For fc > 0, 

1. Compute V/(xfc). 

2. Compute yk'. 

j/fe = argmin^gQ^ ^\\x ~ Xk\\\ + (V/(xfe), x - Xfc). 

3. Compute zj.: 

Zfe = argmin^gQ^ ^PpW +EJ=o a^{'^ f(.Xi),x ~ Xi). 

4. Update Xk'- 

Xk=TkZk + (1 - Tk)yk- 

Stop when a given criterion is valid. 



At step k, yk is the current guess of the optimal solution. If we only performed 
the second step of the algorithm with yk-i instead of Xk, we would obtain a standard 
first-order technique with convergence rate 0{l/k). 

The novelty is that the sequence Zk "keeps in mind" the previous iterations since 
Step 3 involves a weighted sum of already computed gradients. Another aspect of 
this step is that — borrowing ideas from smoothing techniques in optimization [4] — it 
makes use of a prox-function Pp{x) for the primal feasible set Qp. This function is 
strongly convex with parameter ap] assuming that Pp{x) vanishes at the prox-center 
Xp = argmin^ pp (x) , this gives 

Pp{x) > yW^-^pWI- 

The prox-function is usually chosen so that Xp £ Qp, thus discouraging Zk from moving 
too far away from the center Xp. 

The point Xk, at which the gradient of / is evaluated, is a weighted average 
between zj. and yk. In truth, this is motivated by a theoretical analysis [43, 50], 
which shows that if ak = 1/2(A;-|-1) and = 2/(fc + 3), then the algorithm converges 
to 

X* = argmin/(a;) 

with the convergence rate 



(2.3) 
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This decay is far better than what is achievable via standard gradient-based optimiza- 
tion techniques since we have an approximation scahng like L/k"^ instead of L/k. 

2.2. Minimizing nonsmooth convex functions. In an innovative paper [43], 
Nesterov recently extended this framework to deal with nonsmooth convex functions. 
Assume that / can be written as 

/(x) = max(M,W^x), (2.4) 

ueQd 

where x e M", u eW and W G M^^". We will refer to Qd as the dual feasible set, and 
suppose it is convex. This assumption holds for all of the problems of interest in this 
paper — we will see in Section 3 that this holds for the total- variation 

norm and, in general, for any induced norm — yet it provides enough information 
beyond the black-box model to allow cleverly-designed methods with a convergence 
rate scaling like 0{l/k^) rather than 0{1/Vk), in the number of steps k. 

With this formulation, the minimization (2.1) can be recast as the following saddle 
point problem: 

min max {u,Wx). (2-5) 

The point is that / (2.4) is convex but generally nonsmooth. In [43], Nesterov pro- 
posed substituting / by the smooth approximation 

ff_,{x) = ina.x{u,Wx) - fipd{u), (2.6) 

ueQd 

where Pd{u) is a prox-function for Qd', that is, Pd{u) is continuous and strongly convex 
on Qd, with convexity parameter ad (we shall assume that pd vanishes at some point 
in Qd)- Nesterov proved that is continuously differentiable, and that its gradient 
obeys 

V/^(x) = W*u^{x), (2.7) 

where u^{x) is the optimal solution of (2.6). Furthermore, V/^ is shown to be Lips- 
chitz with constant 

L^ = —\\Wr (2.8) 

(IIW^II is the operator norm of W). Nesterov's algorithm can then be applied to f^{x) 
as proposed in [43]. For a fixed /i, the algorithm converges in 0{l/k^) iterations. 
If we describe convergence in terms of the number of iterations needed to reach an 
e solution (that is, the number of steps is taken to produce an x obeying \f^j,{x) — 
min I < e) , then because /i is approximately proportional to the accuracy of the 
approximation, and because is proportional to 1/fJ. ~ 1/s, the rate of convergence 
is 0{^y L^/e) « 0{l/e), a significant improvement over the sub-gradient method 
which has rate 0{l/e^). 

3. Extension to Compressed Sensing. We now extend Nesterov's algorithm 
to solve compressed sensing recovery problems, and refer to this extension as NESTA. 
For now, we shall be concerned with solving the quadratically constrained ii mini- 
mization problem (1.2). 
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3.1. NESTA. We wish to solve (1.2), i.e. minimize subject to |j6—Aa;||£2 < 
e, where A E K™^" jg singular (to < n). 

In this section, we assume that A* A is an orthogonal projector, i.e. the rows of A 
are orthonormal. This is often the case in compressed sensing applications where it 
is common to take A as a submatrix of a unitary transformation which admits a fast 
algorithm for matrix-vector products; special instances include the discrete Fourier 
transform, the discrete cosine transform, the Hadamard transform, the noiselet trans- 
form, and so on. Basically, collecting incomplete structured orthogonal measurements 
is the prime method for efficient data acquisition in compressed sensing. 

Recall that the £i norm is of the form 

\\x\\i^ = max{u,x), 

where the dual feasible set is the £oc ball 

Qd^{u: \\u\\oo <1}. 
Therefore, a natural smooth approximation to the £i norm is 

/^(x) = ma,x{u,x) - ^ipd{u), 

where Pd{u) is our dual prox-function. For pd{u), we would like a strongly convex 
function, which is known analytically and takes its minimum value (equal to zero) at 
some G Qd- It is also usual to have Pd{u) separable. Taking these criteria into 
account, a convenient choice is Pd{u) — whose strong convexity parameter ad 

is equal to 1. With this prox-function, is the well-known Huber function and V/^ 
is Lipschitz with constant l//i.^ In particular, V/^(a;) is given by 

v/.(.)H = ^";M; (3.1) 

lsgn(a;[ij), otherwise. 
Following Nesterov, we need to solve the smooth constrained problem 

mill /^(x), (3.2) 

xeQp 

where Qp — {x : ||6 — AxU^j < e}. Once the gradient of at Xk is computed. Step 2 
and Step 3 of NESTA consist in updating two auxiliary iterates, namely, and Z}.- 

3.2. Updating y^. To compute y^, we need to solve 

yk = argmin ^llxfc - x\\'j^ + {V f^,{xk),x - Xk), (3.3) 
x&Qp ^ 

where Xk is given. The Lagrangian for this problem is of course 

A) = ^\\xk - x\\l + ^ {\\b - Ax\\l - e') + {VUixk),x- Xk), (3.4) 



^In the case of total- variation minimization in which f{x) = is not a known function. 
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and at the primal-dual solution {yk,Xe), the Karush-Kuhn- Tucker (KKT) conditions 
[47] read 

\\b-Ayk\\l<e, 

K > 0, 

Xe{\\b~Ayk\\l~e')^0, 
L^{yk - Xk) + KA*{Ayk -b)+ V/^(xfe) = 0. 

From the stationarity condition, yj. is the solution to the linear system 

/ + ^A*a\ vk = -^A*b + xk^ ^s/^xk). (3.5) 
As discussed earlier, our assumption is that A* A is an orthogonal projector so that 

In this case, computing yk is cheap since no matrix inversion is required — only a few 
matrix-vector products are necessary. Moreover, from the KKT conditions, the value 
of the optimal Lagrange multiplier is obtained explicitly, and equals 

A, = max(0,e-i||6- Agllf, -L^), q = Xk - L'^V /^(xk). (3.7) 

Observe that this can be computed beforehand since it only depends on a;^ and 
V^(a;,). 

3.3. Updating Zk- To compute Zk, we need to solve 

Zk = argmin ^pp{x) + (Y] aiV/^(xi), x - Xk), (3.8) 



i<k 



where Pp{x) is the primal prox-function. The point Zk differs from yk since it is 
computed from a weighted cumulative gradient X)i<fc '^i^ f^l{xi), making it less prone 
to zig-zagging, which typically occurs when we have highly elliptical level sets. This 
step keeps a memory from the previous steps and forces Zk to stay near the prox- 
center. 

A good primal prox-function is a smooth and strongly convex function that is 
likely to have some positive effect near the solution. In the setting of (1.1), a suitable 
smoothing prox-function may be 

Pp{x) = ]^\\x - xo\W (3.9) 

for some xq G M", e.g. an initial guess of the solution. Other choices of primal 
feasible set Qp may lead to other choices of prox-functions. For instance, when Qp 
is the standard simplex, choosing an entropy distance for Pp{x) is smarter and more 
efficient, see [43]. In this paper, the primal feasible set is quadratic, which makes the 
Euclidean distance a reasonable choice. What is more important, however, is that 
this choice allows very efficient computations of yk and Zk while other choices may 
considerably slow down each Nesterov iteration. Finally, notice that the bound on the 
error at iteration k in (2.3) is proportional to pp(x*); choosing xq wisely (a good first 
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guess) can make pp{x*) small. When nothing is known about the solution, a natural 
choice may be Xq = A*b; this idea will be developed in Section 3.6. 

With (3.9), the strong convexity parameter oi Pp{x) is equal to 1, and to compute 
Zk we need to solve 

Zk = argmin^llx-xoll^^ + ^\\b - Ax\\l^ + a,V/^(xi), x - Xfe) (3.10) 
for some value of A. Just as before, the solution is given by 

Zk =(l- xTr^*A f + xo-^Y. 1 > (3-11) 

with a value of the Lagrange multiplier equal to 

A, = max(0,e-i||6-Ag||,, -L^), q = - ^ VaJ^(a;,). (3.12) 

i<k 

In practice, the instances {V/^(a;i)}i<fe have not to be stored; one just has to store 
the cumulative gradient J2i<k '^i'^ ft^i^i)- 

3.4. Computational complexity. The computational complexity of each of 
NESTA's step is clear. In large-scale problems, most of the work is in the application 
of A and A*. Put Ca for the complexity of applying A or A* . The first step, namely, 
computing V/^, only requires vector operations whose complexity is 0{n). Step 2 
and 3 require the application of A or A* three times each (we only need to compute 
A*b once). Hence, the total complexity of a single NESTA iteration is 6Ca + 0{n) 
where Ca is dominant. 

The calculation above are in some sense overly pessimistic. In compressed sensing 
applications, it is common to choose A as a submatrix of a unitary transformation 
U, which admits a fast algorithm for matrix- vector products. In the sequel, it might 
be useful to think of A as a subsampled DPT. In this case, letting R be the m x n 
matrix extracting the observed measurements, we have A — RU. The trick then is 
to compute in the [/-domain directly. Making the change of variables x <— C/x, our 
problem is 

minimize fp-i^) 

subject to ||6— Rx\\i2 < e, 

where — f^°U* . The gradient of is then 

Vf^{x)^UVf^{U*x). 

With this change of variables. Steps 2 and 3 do not require applying U or U* since 

yk= (^-XT^^*^) + 

where R* R is the diagonal matrix with 0/1 diagonal entries depending on whether 
a coordinate is sampled or not. As before, A^ = max(0, \\b — RqWi^ ~ L^) with q = 
Xk — L^^V/p(xfc). The complexity of Step 2 is now 0{n) and the same applies to 
Step 3. 
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Put Cu for the complexity of applying U and U* . The complexity of Step 1 is now 
2C{j, so that this simple change of variables reduces the cost of each NESTA iteration 
to 2Cu + 0{n). For example, in the case of a subsampled DFT (or something similar), 
the cost of each iteration is essentially that of two FFTs. Hence, each iteration is 
extremely fast. 

3.5. Parameter selection. NESTA involves the selection of a single smoothing 
parameter /i and of a suitable stopping criterion. For the latter, our experience indi- 
cates that a robust and fairly natural stopping criterion is to terminate the algorithm 
when the relative variation of is small. Define A/^ as 



for some 5 > 0. In our experiments, S E {10^^, 10~^, 10~^, 10~®} depending upon the 
desired accuracy. 

The choice of ^ is based on a trade-off between the accuracy of the smoothed 
approximation (basically, lim^^o //i(2;) — \\x\\ei) and the speed of convergence 
(the convergence rate is proportional to /i). With noiseless data, /i is directly linked 
to the desired accuracy. To illustrate this, we have observed in [7] that when the true 
signal x'^ is exactly sparse and is actually the minimum solution under the equality 
constraints Ax'^ = 6, the loo error on the nonzero entries is on the order of [i. The 
link between and accuracy will be further discussed in Section 4.3. 

3.6. Accelerating NESTA with continuation. Inspired by homotopy tech- 
niques which find the solution to the lasso problem (1.4) for values of r ranging in 
an interval [0,rmax], [34] introduces a fixed point continuation technique which solves 
£i-penalized least-square problems (1.3) 



for values of A obeying < A < The continuation solution approximately 

follows the path of solutions to the problem (QP^) and, hence, the solutions to (1.1) 
and (1.4) may be found by solving a sequence a i\ penalized least-squares problems. 

The point of this is that it has been noticed (see [34, 45, 27]) that solving (1.3) 
(resp. the lasso (1.4)) is faster when A is large (resp. r is low). This observation 
greatly motivates the use of continuation for solving (1.3) for a fixed A/. The idea 
is simple: propose a sequence of problems with decreasing values of the parameter 
A, Ao > ■ • • > A/, and use the intermediate solution as a warm start for the next 
problem. This technique has been used with some success in [31, 51]. Continuation 
has been shown to be a very successful tool to increase the speed of convergence, in 
particular when dealing with large-scale problems and high dynamic range signals. 

Likewise, our proposed algorithm can greatly benefit from a continuation ap- 
proach. Recall that to compute j/fc, we need to solve 




(3.13) 



Then convergence is claimed when 



(QPa) minimize A||a;||£j H — 1|6 — Axj 



yk = argmin ^llx - Xk\\\ + {c,x 



= argminjla; - {xk - L^^c)\\j^ 
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for some vector c. Thus with Vq^ the projector onto Qp, yt — 'PQ^{xk — L^^c). Now 
two observations are in order. 

1. Computing yk is similar to a projected gradient step as the Lipschitz constant 
plays the role of the step size. Since is proportional to the larger /i, the 

larger the step-size, and the faster the convergence. This also applies to the sequence 
{zk}- 

2. For a fixed value of /x, the convergence rate of the algorithm obeys 

2LJ\x* -xqP, 

U{yk)-Ux;)< 

where x* is the optimal solution to min over Qp. On the one hand, the convergence 
rate is proportional to /i^^, so a large value of fi is beneficial. On the other hand, 
choosing a good guess xq close to a;* provides a low value of Pp(a;*) — ^Ija;* — a:^ol!^2, 
also improving the rate of convergence. Warm-starting with xq from a previous solve 
not only changes the starting point of the algorithm, but it beneficially changes Pp as 
well. 

These two observations motivate the following continuation-like algorithm: 



Initialize /iq, xq and the number of continuation steps T. For < > 1, 

1. Apply Nesterov's algorithm with fi — and Xq — x^(t-i). 

2. Decrease the value of /i: Tm''*'' with 7 < 1. 
Stop when the desired value of /iy is reached. 



This algorithm iteratively finds the solutions to a succession of problems with de- 
creasing smoothing parameters fiQ > ■ ■ ■ > fj,f — 7"^/io producing a sequence of — 
hopefully — finer estimates of x*^ ; these intermediate solutions are cheap to compute 
and provide a string of convenient first guess for the next problem. In practice, they 
are solved with less accuracy, making them even cheaper to compute. 

The value of /i/ is based on a desired accuracy as explained in Section 3.5. As 
for an initial value fiQ, (3.1) makes clear that the smoothing parameter plays a role 
similar to a threshold. A first choice may then be /zq — 0.9\\A*b\\i^. 

We illustrate the good behavior of the continuation-inspired algorithm by apply- 
ing NESTA with continuation to solve a sparse reconstruction problem from partial 
frequency data. In this series of experiments, we assess the performance of NESTA 
while the dynamic range of the signals to be recovered increases. 

The signals x are s-sparse signals — that is, have exactly s nonzero components — 
of size n = 4096 and s = m/40. Put A for the indices of the nonzero entries of x; the 
amplitude of each nonzero entry is distributed uniformly on a logarithmic scale with 
a fixed dynamic range. Specifically, each nonzero entry is generated as follows: 

= 7yi[i]10"''^W, (3.14) 

where r]i[i\ = ±1 with probability 1/2 (a random sign) and 772 [i] is uniformly dis- 
tributed in [0,1]. The parameter a quantifies the dynamic range. Unless specified 
otherwise, a dynamic range of d dB means that a = d/20 (since for large signals a 
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Figure 3.1. Value o//py.(xfe) as a function of iteration k. Solid line: without continuation. 
Dashed line: with continuation. Here, the test signal has 100 dB of dynamic range. 

is approximately the logarithm base 10 of the ratio between the largest and the low- 
est magnitudes). For instance, 80 dB signals are generated according to (3.14) with 
a = 4. 

The measurements Ax consist of m = n/8 random discrete cosine measurements 
so that A* A is diagonalized by the DCT. Finally, h is obtained by adding a white 
Gaussian noise term with standard deviation a — 0.1. The initial value of the smooth- 
ing parameter is /iq = ||j4*6||£^ and the terminal value is /iy = 2a. The algorithm 
terminates when the relative variation of is lower than 6 = 10~^. NESTA with 
continuation is applied to 10 random trials for varying number of continuation steps 
T and various values of the dynamic range. Figure 3.1 graphs the value of /p^. while 
applying NESTA with and without continuation as a function of the iteration count. 
The number of continuation steps is set to T = 4. 

One can observe that computing the solution to min /^^ (solid line) takes a while 
when computed with the final value /ij; notice that NESTA seems to be slow at the 
beginning (number of iterations lower than 15). In the meantime NESTA with con- 
tinuation rapidly estimates a sequence of coarse intermediate solutions that converges 
to the solution to min/^^ In this case, continuation clearly enhances the global speed 
of convergence with a factor 10. Figure 3.2 provides deeper insights into the behavior 
of continuation with NESTA and shows the number of iterations required to reach 
convergence for varying values of the continuation steps T for different values of the 
dynamic range. 

When the ratio fJ-o/fJ-f is low or when the required accuracy is low, continuation 
is not as beneficial: intermediate continuation steps require a number of iterations 
which may not speed up overall convergence. The stepsize which is about Lj^j works 
well in this regime. When the dynamic range increases and we require more accu- 
racy, however, the ratio /io/M/ is large, since /io = •9|l^*^ll^=o ~ ^ '^^ ^'^d 
continuation provides considerable improvements. In this case, the step size L~j is 
too conservative and it takes a while to find the large entries of x. Empirically, when 
the dynamic range is 100 dB, continuation improves the speed of convergence by a 
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Figure 3.2. Total number of iterations required for convergence with a varying number of 
continuation steps and for different values of the dynamic range. 



factor of 8. As this factor is likely to increase exponentially with the dynamic range 
(when expressed in dB), NESTA with continuation seems to be a better candidate for 
solving sparse reconstruction problems with high accuracy. 

hiterestingly, the behavior of NESTA with continuation seems to be quite stable: 
increasing the number of continuation steps does not increase dramatically the number 
of iterations. In practice, although the ideal T is certainly signal dependent, we have 
observed that choosing T G {4, 5, 6} leads to reasonable results. 

3.7. Some theoretical considerations. The convergence of NESTA with and 
without continuation is straightforward. The following theorem states that each con- 
tinuation step with /i = /i*^*-* converges to a:* . Global convergence is proved by applying 
this theorem to t = T. 

Theorem 3.1. At each continuation step t, lim^^^^oo J/fe — ^*^{t), OLi^d 

f I . r . . ^ ^ 2L^(.)||a;*(,) - ||^, 
ft,(n(yk) - /p(o(a;^(t)) < ^ . 



Proof. Immediate by using [43, Theorem 2]. □ 

As mentioned earlier, continuation may be valuable for improving the speed of 
convergence. Let each continuation step t stop after A/"^*-* iterations with 

so that we have 
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Figure 3.3. Typical solution paths - Left: smooth solution path. Right: nonsmooth solution path. 



where the accuracy 7*i5o becomes tighter as t increases. Then summing up the con- 
tribution of all the continuation steps gives 



When NESTA is apphed without continuation, the number of iterations required 
to reach convergence is 



Now the ratio Nc/M is given by 

Continuation is definitely worthwhile when the right-hand side is smaller than 1. 
Interestingly, this quantity is directly linked to the path followed by the sequence 
Xq — + —>•••—> x^j. . More precisely, it is related to the smoothness of this path; 
for instance, if all the intermediate points x^a) belong to the segment [a;o,a;^^] in 
an ordered fashion, then ll^^ct) ~ x^iit-'^)\\i2 — 11^^/ ~ 2;o||f2- Hence, ^ < 1 and 
continuation improves the convergence rate. 

Figure 3.3 illustrates two typical solution paths with continuation. When the 
sequence of solutions obeys ||xo||£i > . . • ||a;*(t)||^i ■ • ■ > ll^^^/lki (this is the case when 
Xq = A*b and /^i > • ■ ■ m'-*'' • • • > ^J^f), the solution path is likely to be "smooth;" that 
is, the solutions obey ||a;*(t) — a;*^. > ||a;*(t+i) — x*^!!^^ as on the left of Figure 3.3. 
The "nonsmooth" case on the right of Figure 3.3 arises when the sequence of smooth- 
ing parameters docs not provide estimates of x*^. that are all better than xo. Here, 
computing some of the intermediate points {x*^^t)} is wasteful and continuation fails 
to be faster. 
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Table 4.1 

Assessing FISTA's and NESTA's accuracy when the optimal solution is known. The relative 
error on the optimal value is given by (4.1) and the loo error on the optimal solution by (4.2). Ma 
is the number of calls to A or A* to compute the solution. 



Method 


^i-nomi 


Rel. error i?i-norm 


ioo error 


Ma 


X* 


3.33601e+6 








FISTA 


3.33610e+6 


2.7e-5 


0.31 


40000 


NESTA 11 = 0.02 


3.33647e+6 


1.4e-4 


0.08 


513 



4. Accurate Optimization. A significant fraction of the numerical part of this 
paper focuses on comparing different sparse recovery algorithms in terms of speed 
and accuracy. In this section, we first demonstrate that NESTA can easily recover 
the exact solution to (BP^) with a precision of 5 to 6 digits. Speaking of precision, 
we shall essentially use two criteria to evaluate accuracy. 

1. The first is the (relative) error on the objective functional 

where x* is the optimal solution to (BP^) . 

2. The second is the accuracy of the optimal solution itself and is measured via 



\\X~X*\U^: (4.2) 

which gives a precise value of the accuracy per entry. 

4.1. Is NESTA accurate? For general problem instances, the exact solution 
to (BPj) (or equivalently (QP;^)) cannot be computed analytically. Under some con- 
ditions, however, a simple formula is available when the optimal solution has exactly 
the same support and the same sign as the unknown (sparse) a;" (recall the model 
b — Ax'^ + z). Denote by / the support of a;°, I :— {i : > 0}. Then if is 

sufficiently sparse and if the nonzero entries of x'^ are sufficiently large, the solution 
X* to (QPa) is given by 

x*[I] = iA[irA[I])-\A[I]*b- Xsgnix'^m, (4.3) 
x'-il"] = 0, (4.4) 

see [12] for example. In this expression, x[I] is the vector with indices in / and A[I] 
is the submatrix with columns indices in /. 

To evaluate NESTA's accuracy, we set n = 262,144, m = n/S, and s ~ m/100 
(this is the number of nonzero coordinates of xq)- The absolute values of the nonzero 
entries of xq are distributed between 1 and 10^ so that we have about 100 dB of 
dynamic range. The measurements Ax*^ are discrete cosine coefficients selected uni- 
formly at random. We add Gaussian white noise with standard deviation a = 0.01. 
We then compute the solution (4.3), and make sure it obeys the KKT optimality 
conditions for (QP^) so that the optimal solution is known. 

We run NESTA with continuation with the value of e := ||6 — Aa;*||. We use 
/i = 0.02, 5 = 10^^ and the number of continuation steps is set to 5. Table 4.1 reports 
on numerical results. First, the value of the objective functional is accurate up to 4 
digits. Second, the computed solution is very accurate since we observe an £00 error 
of 0.08. Now recall that the nonzero components of x* vary from about 1 to 10^ so 
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Entries of x 



Figure 4.1. Entries of the computed solutions versus the optimal solution. The absolute values 
of the entries on the support of the optimal solution are plotted. 

that we have high accuracy over a huge dynamic range. This can also be gleaned from 
Figure 4.1 which plots NESTA's solution versus the optimal solution, and confirms 
the excellent precision of our algorithm. 

4.2. Setting up a reference algorithm for accuracy tests. In general situ- 
ations, a formula for the optimal solution is of course unavailable, and evaluating the 
accuracy of solutions requires defining a method of reference. In this paper, we will 
use FISTA [3] as such a reference since it is an efficient algorithm that also turns out 
to be extremely easy to use; in particular, no parameter has to be tweaked, except 
for the standard stopping criterion (maximum number of iterations and tolerance on 
the relative variation of the objective function). 

We run FISTA with 20,000 iterations on the same problem as above, and report 
its accuracy in Table 4.1. The ^i-norm is exact up to 4 digits. Furthermore, Figure 4.1 
shows the entries of FISTA's solution versus those of the optimal solution, and one 
observes a very good fit (near perfect when the magnitude of a component of x* is 
higher than 3). The £oo error between FISTA's solution and the optimal solution x* 
is equal to 0.31; that is, the entries are exact up to ±0.31. Because this occurs over an 
enormous dynamic range, we conclude that FISTA also gives very accurate solutions 
provided that sufficiently many iterations are taken. We have observed that running 
FISTA with a high number of iterations — typically greater than 20,000 — provides 
accurate solutions to (QP_>^), and this is why we will use it as our method of reference 
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Table 4.2 

NESTA's accuracy. The errors and number of function calls Ma have the same meaning as in 
Table 4.1. 



Method 


£i-norm 


Rcl. error ^i-norm 


ioo error 




FISTA 


5.71539C+7 








NESTA 


= 


0.2 


5.71614C+7 


1.3e-4 


3.8 


659 


NESTA 


= 


0.02 


5.71547e+7 


1.4e-5 


0.96 


1055 


NESTA 


M = 


0.002 


5.71540e+7 


1.6e-6 


0.64 


1537 



in the forthcoming comparisons from this section and the next. 

4.3. The smoothing parameter /i and NESTA's accuracy. By definition, 
fj, fixes the accuracy of the approximation to the £i norm and, therefore, NESTA's 
accuracy directly depends on this parameter. We now propose to assess the accuracy 
of NESTA for different values of /i. The problem sizes are as before, namely, n = 
262,144 and m = n/8, except that now the unknown a;° is far less sparse with s = to/5. 
The standard deviation of the additive Gaussian white noise is also higher, and we 
set a = 0.1. 

Because of the larger value of s and a, it is no longer possible to have an analytic 
solution from (4.3). Instead, we use FISTA to compute a reference solution xp, using 
20,000 iterations and with A = 0.0685, which gives ||6 - Axf\\% ~ (to + 2V2rri)a^. 
To be sure that FISTA's solution is very close to the optimal solution, we check that 
the KKT stationarity condition is nearly verified. If li, is the support of the optimal 
solution X*, this condition reads 

A[h]*{b ^ Ax*) ^ Xsgn{x*[h]), 
\\A[Il]*{b-Ax*)\U^<X. 

Now define / to be the support of xp- Then, here, xp obeys 

\\A[I]*{b~ Axp) - Xsgii{x f[I])\\i^ = 2.6610~i°A, 
\\A[r]*{b^ AxF)\\t^ < 0.99A. 

This shows that xp is extremely close to the optimal solution. 

NESTA is run with T = 5 continuation steps for three different values of /i €E 
{0.2,0.02,0.002} (the tolerance S is set to 10^^, 10^"^ and 10"® respectively). Fig- 
ure 4.2 plots the solutions given by NESTA versus the "optimal solution" xp- Clearly, 
when /i decreases, the accuracy of NESTA increases just as expected. More precisely, 
notice in Table 4.2 that for this particular experiment, decreasing /i by a factor of 10 
gives about 1 additional digit of accuracy on the optimal value. 

According to this table, fj, = 0.02 seems a reasonable choice to guarantee an 
accurate solution since one has between 4 and 5 digits of accuracy on the optimal 
value, and since the €oo error is lower than 1. Observe that this value separates the 
nonzero entries from the noise floor (when a = 0.01). In the extensive numerical 
experiments of Section 5, we shall set /i = 0.02 and S = 10"^ as default values. 

5. Numerical comparisons. This section presents numerical experiments com- 
paring several state-of-the-art optimization techniques designed to solve (1.2) or (1.3). 
To be as fair as possible, we propose comparisons with methods for which software is 
publicly available online. To the best of our knowledge, such extensive comparisons 
are currently unavailable. Moreover, whereas publications sometimes test algorithms 
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Figure 4.2. Entries of the computed solutions versus the optimal solution. We plot the absolute 
values of the entries on the set where the magnitude of the optimal solution exceeds 1. 



on relatively easy and academic problems, we will subject optimization methods to 
hard but realistic (.i reconstruction problems. 

In our view, a challenging problem involves some or all of the characteristics 
below. 

1. High dynamic range. As mentioned earlier, most optimization techniques 
are able to find (more or less rapidly) the most significant entries (those with a large 
amplitude) of the signal x. Recovering the entries of x that have low magnitudes 
accurately is more challenging. 

2. Approximate sparsity. Realistic signals are seldom exactly sparse and, there- 
fore, coping with approximately sparse signals is of paramount importance. In signal 
or image processing for example, wavelet coefficients of natural images contain lots of 
low level entries that are worth retrieving. 

3. Large scale. Some standard optimization techniques, such as interior point 
methods, are known to provide accurate solutions. However, these techniques are 
not applicable to large-scale problems due to the large cost of solving linear systems. 
Further, many existing software packages fail to take advantage of fast-algorithms for 
applying A. We will focus on large-scale problems in which the number of unknowns 
n is over a quarter of a million, i.e. n — 262,144. 

5.1. State-of-the-art methods. Most of the algorithms discussed in this sec- 
tion are considered to be state-of-art in the sense that they are the most competitive 
among sparse reconstruction algorithms. To repeat ourselves, many of these methods 
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have been improved after several years of research [36, 31], and many did not exist 
two years ago [34, 51]. For instance, [35] was submitted for publication less than 
three months before we put the final touches on this paper. Finally, our focus is on 
rapid algorithms so that we are interested in methods which can take advantage of 
fast algorithms for applying A to a vector. This is why we have not tested other good 
methods such as [32], for example. 

5.1.1. NESTA. Below, we apphed NESTA with the following default parame- 
ters 

xo = A*b, ^ = 0.02, 5^10"^ 

(recall that xq is the initial guess). The maximal number of iterations is set to 
2^max — 10,000; if convergence is not reached after Imax iterations, we record that the 
algorithm did not convergence (DNC). Because NESTA requires 2 calls to either A 
or A* per iteration, this is equivalent to declaring DNC after A/^ = 20,000 iterations 
where Ma refers to the total number of calls to A or A*; hence, for the other methods, 
we declare DNC when A/a > 20,000. When continuation is used, extra parameters 
are set up as follows: 

T = 4, fiQ^WxoWi^, 7 (^//io)^/^, 

and for t — 1, . . . , T, 

f^t = 7V0, St = 0.1 • (J/0.1)*/^. 
Numerical results are reported and discussed in Section 5.4. 

5.1.2. Gradient Projections for Sparse Reconstruction (GPSR) [31]. 

GPSR has been introduced in [31] to solve the standard ii minimization problem in 
Lagrangian form (QP;^). GPSR is based on the well-known projected gradient step 
technique, 

yik+i) = (^^(k-i) _ akVF{v,)) , 

for some projector Vq onto a convex set Q; this set contains the variable of interest v. 
In this equation, F is the function to be minimized. In GPSR, the problem is recast 
such that the variable v — [vi,V2] has positive entries and x — vi — V2 (a standard 
change of variables in linear programming methods). The function F is then 

F{v) = XVv+^-\\b-[A,-A]v\\l, 

where 1 is the vector of ones, and v belongs to the nonnegative orthant, v[i] > for 
all i. The projection onto Q is then trivial. Different techniques for choosing the step- 
size ak (backtracking, Barzilai-Borwein [2], and so on) are discussed in [31]. The code 
is available at http://www.lx.it.pt/~mtf/GPSR/. In the forthcoming experiments, 
the parameters are set to their default values. 

GPSR also implements continuation, and we test this version as well. All pa- 
rameters were set to defaults except, per the recommendation of one of the GPSR 
authors to increase performance, the number of continuation steps was set to 40, the 
ToleranceA variable was set to 10^"^, and the MiniterA variable was set to 1. In 
addition, the code itself was tweaked a bit; in particular, the stopping criteria for 
continuation steps (other than the final step) was changed. Future releases of GPSR 
will probably contain a similarly updated continuation stopping criteria. 
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5.1.3. Sparse reconstruction by separable approximation (SpaRSA) [54]. 

SpaRSA is an algorithm to minimize composite functions (t>{x) — f{x) + \c(x) com- 
posed of a smooth term / and a separable non-smooth term c, e.g. (QP_;^). At every 
step, a subproblem of the form 

minimize lla; — yllfl H cix) 

a 

with optimization variable x must be solved; this is the same as computing the prox- 
imity operator corresponding to c. For (QP_)^), the solution is given by shrinkage. In 
this sense, SpaRSA is an iterative shrinkage/ thresholding (1ST) algorithm, much like 
FISTA (though without the accelerated convergence) and FPC. Also like FPC, contin- 
uation is used to speed convergence, and like FPC-BB, a Barzilai-Borwein heuristic 
is used for the step size a (instead of using a pessimistic bound like the Lipschitz 
constant). With this choice, SpaRSA is not guaranteed to be monotone, which can 
be remedied by implementing an appropriate safeguard, although this is not done in 
practice because there is little experimental advantage to doing so. Code for SpaRSA 
may be obtained at http : //www . Ix . it .pt/~mtf /SpaRSA/. Parameters were set to 
default except the number of continuation steps was set to 40 and the MiniterA vari- 
able was set to 1 (instead of the default 5) , as per the recommendations of one of the 
SpaRSA authors — again, as to increase performance. 

5.1.4. li regularized least squares (ll_ls) [36]. This method solves the stan- 
dard unconstrained minimization problem, and is an interior point method (with 
log-barrier) using preconditioned conjugate gradient (PCG) to accelerate convergence 
and stabilize the algorithm. The preconditioner used in the PCG step is a linear 
combination of the diagonal approximation of the Hessian of the quadratic term and 
of the Hessian of the log-barrier term. llJs is shown to be faster than usual interior 
point methods; nevertheless, each step requires solving a linear system of the form 
HAx — g. Even if PCG makes the method more reliable, ll_ls is still problematic for 
large-scale problems. In the next comparisons, we provide some typical values of its 
computational complexity compared to the other methods. The code is available at 
http : //www. Stanford. edu/^boyd/ll_ls/. 

5.1.5. Spectral projected gradient (SPGLl) [51]. In 2008, van den Berg 
et al. adapted the spectral projection gradient algorithm introduced in [6] to solve 
the LASSO (LS,-). Interestingly, they introduced a clever root finding procedure 
such that solving a few instances of (LSr) for different values of r enables them 
to equivalently solve (BP,:). Furthermore, if the algorithm detects a nearly-sparse 
solution, it defines an active set and solves an equation like (4.3) on this active set. 
In the next experiments, the parameters are set to their default values. The code is 
available at http://www.cs.ubc.ca/labs/scl/SPGLll/. 

5.1.6. Fixed Point Continuation method (FPC) [34, 35]. The Fixed Point 
Continuation method is a recent first-order algorithm for solving (QP;^) and simple 
generalizations of (QP^) • The main idea is based on a fixed point equation, x — F{x), 
which holds at the solution (derived from the subgradient optimality condition). For 
appropriate parameters, F is a, contraction, and thus the algorithm Xk+i = Fixk) 
converges. The operator F comes from forward-backward splitting, and consists of 
a soft-thresholding/shrinkage step and a gradient step. The main computational 
burden is one application of A and A* at every step. The papers [34, 35] prove 
g-linear convergence, and finite-convergence of some of the components of x for s- 
sparse signals. The parameter A in (QP;^) determines the amount of shrinkage and. 
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therefore, the speed of convergence; thus in practice, A is decreased in a continuation 
scheme. Code for FPC is available at http://www.caam.rice.edu/~optimization/ 
Ll/fpc/. Also available is a state-of-the-art version of FPC from 2008 that uses 
Barzilai-Borwein [2] steps to accelerate performance. In the numerical tests, the 
Barzilai-Borwein version (referred to as FPC-BB) significantly outperforms standard 
FPC. All parameters were set to default values. 

5.1.7. FPC Active Set (FPC-AS) [53]. In 2009, inspired by both first-order 
algorithms, such as FPC, and greedy algorithms [28, 40], Wen et al. [53] extend 
FPC into the two-part algorithm FPC Active Set to solve (QP_)^). In the first stage, 
FPC-AS calls an improved version of FPC that allows the step-size to be updated 
dynamically, using a non-monotone exact line search to ensure r-linear convergence, 
and a Barzilai-Borwein [2] heuristic. After a given stopping criterion, the current 
value, Xk, is hard-thresholded to determine an active set. On the active set, \\x\\e^ 
is replaced by c*x, where c = sgn(xfc), with the constraints that x[i] ■ c[i] > for all 
the indices i belonging to the active set. The objective is now smooth, and solvers, 
like conjugate gradients (CG) or quasi-Newton methods (e.g. L-BFGS or L-BFGS- 
B), can solve for x on the active set; this the same as solving (4.3). This two- 
step process is then repeated for a smaller value of A in a continuation scheme. We 
tested FPC-AS using both L-BFGS (the default) and CG (which we refer to as FPC- 
AS-CG) to solve the subproblem; both of these solvers do not actually enforce the 
x[i] ■ c[i] > constraint on the active set. Code for FPC-AS is available at http: 
//www. caam. rice . edu/~optimization/Ll/FPC_AS/. 

For s-sparse signals, all parameters were set to defaults except for the stopping 
criteria (as discussed in Section 5.3). For approximately sparse signals, FPC-AS 
performed poorly (> 10, 000 iterations) with the default parameters. By changing 
a parameter that controls the estimated number of nonzeros from to/2 (default) to 
n, the performance improved dramatically, and this is the performance reported in 
the tables. The maximum number of subspace iterations was also changed from the 
default to 10, as recommended in the help file. 

5.1.8. Bregman. The Bregman Iterative algorithm, motivated by the Bregman 
distance, has been shown to be surprisingly simple [56]. The first iteration solves 
(QP;^) for a specified value of A; subsequent iterations solve (QP^) for the same value 
of A, with an updated observation vector b. Typically, only a few outer iterations are 
needed (e.g. 4), but each iteration requires a solve of (QP;^), which is costly. The 
original Bregman algorithm calls FPC to solve these subproblems; we test Bregman 
using FPC and the Barzilai-Borwein version of FPC as subproblem solvers. 

A version of the Bregman algorithm, known as the Linearized Bregman algo- 
rithm [46, 9], takes only one step of the inner iteration per outer iteration; con- 
sequently, many outer iterations are taken, in contrast to the regular Bregman al- 
gorithm. It can be shown that linearized Bregman is equivalent to gradient as- 
cent on the dual problem. Linearized Bregman was not included in the tests be- 
cause no standardized public code is available. Code for the regular Bregman algo- 
rithm may be obtained at http://www.caam.rice.edu/~optimization/Ll/2006/ 
10/bregman-iterative-algorithms-f or .html. There are quite a few parameters, 
since there are parameters for the outer iterations and for the inner (FPC) iterations; 
for all experiments, parameters were set to defaults. In particular, we noted that 
using the default stopping criteria for the inner solve, which limited FPC to 1,000 
iterations, led to significantly better results than allowing the subproblem to run to 
10,000 iterations. 
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5.1.9. Fast Iterative Soft-Thresholding Algorithm (FISTA). FISTA is 
based upon Nesterov's work but departs from NESTA in two important ways: 1) 
FISTA solves the sparse unconstrained reconstruction problem (QP_;^); 2) FISTA is a 
proximal subgradient algorithm, which only uses two sequences of iterates. In some 
sense, FISTA is a simplified version of the algorithm previously introduced by Nesterov 
to minimize composite functions [44] . The theoretical rate of convergence of FISTA 
is similar to NESTA 's, and has been shown to decay as ©(l/fc^). 

For each test, FISTA is run twice: it is first run until the relative variation in the 
function value is less than 10^^*, with no limit on function calls, and this solution is 
used as the reference solution. It is then run a second time using either Criteria 1 or 
Criteria 2 as the stopping condition, and these are the results reported in the tables. 

5.2. Constrained versus unconstrained minimization. We would like to 
briefiy highlight the fact that these algorithms are not solving the same problem. 
NESTA and SPGLl solve the constrained problem (BPg), while all other methods 
tested solve the unconstrained problem (QP_i^). As the first chapter of any optimiza- 
tion book will emphasize, solving an unconstrained problem is in general much easier 
than a constrained problem.^ For example, it may be hard to even find a feasible 
point for (BPg), since the pseudo-inverse of A, when A is not a projection, may be 
difficult to compute. It is possible to solve a sequence of unconstrained (QP^j) prob- 
lems for various Xj to approximately find a value of the dual variable A that leads to 
equivalence with (BPc), but even if this procedure is integrated with the continua- 
tion procedure, it will require several, if not dozens, of solves of (QP^) (and this will 
in general only lead to an approximate solution to (BPf)). The Newton-based root 
finding method of SPGLl relies on solving a sequence of constrained problems (LSt-); 
basically, the dual solution to a constrained problem gives useful information. 

Thus, we emphasize that SPGLl and NESTA are actually more general than the 
other algorithms (and as Section 6 shows, NESTA is even more general because it 
handles a wide variety of constrained problems); this is especially important because 
from a practical viewpoint, it may be easier to estimate an appropriate e than an 
appropriate value of A. Furthermore, as will be shown in Section 5.4, SPGLl and 
NESTA with continuation are also the most robust methods for arbitrary signals 
(i.e. they perform well even when the signal is not exactly sparse, and even when it 
has high dynamic range). Combining these two facts, we feel that these two algorithms 
are extremely useful for real-world applications. 

5.3. Experimental protocol. In these experiments, we compare NESTA with 
other efficient methods. There are two main difficulties with comparisons which might 
explain why broad comparisons have not been offered before. The first problem is 
that some algorithms, such as NESTA, solve (BP^), whereas other algorithms solve 
(QPa)- Given e, it is difficult to compute A(e) that gives an equivalence between the 
problems; in theory, the KKT conditions give A, but we have observed in practice that 
because we have an approximate solution (albeit a very accurate one), computing A 
in this fashion is not stable. 

Instead, we note that given A and a solution x\ to (QP_)^), it is easy to compute a 
very accurate e(A) since e = ll^a;^ — bWe^- Hence, we use a two-step procedure. In the 

first step, we choose a value of eo — \/m + 2\/2mcr based on the noise level a (since a 

^The constrained problem (BPe) is equivalent to that of minimizing ||a;||^j +XQp(^) where Qp 
is the feasible set {x : \\Ax — b\\i^ < e}, and XQp(^) = if x £ Qp and +oo otherwise. Hence, the 
unconstrained problem has a discontinuous objective functional. 
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value of A that corresponds to a is less clear), and use SPGLl to solve (BP^). From 
the SPGLl dual solution, we have an estimate of A = A(eo). As noted above, this 
equivalence may not be very accurate, so the second step is to compute ei = e(A) via 
FISTA, using a very high accuracy of ^ = 10""'^^. The pair (A, ei) now leads to nearly 
equivalent solutions of (QP^) and (BPg). The solution from FISTA will also be used 
to judge the accuracy of the other algorithms. 

The other main difficulty in comparisons is a fair stopping criterion. Each algo- 
rithm has its own stopping criterion (or may offer a choice of stopping criteria), and 
these are not directly comparable. To overcome this difficulty, we have modified the 
codes of the algorithms to allow for two new stopping criterion that we feel are the 
only fair choices. The short story is that we use NESTA to compute a solution xm 
and then ask the other algorithms to compute a solution that is at least as accurate. 

Specifically, given NESTA's solution xn (using continuation), the other algo- 
rithms terminate at iteration k when the solution Xk satisfies 

(Grit. 1) \\xk\W<\\xN\W and ||6 - < 1.05 ||6 - Ax^Hf,, (5.1) 

or 

(Grit. 2) \\\xk\V, + \\\Axk-h\\%<\\\xN\W + \\\AxN -h\\X. (5.2) 

We run tests with both stopping criteria to reduce any potential bias from the fact 
that some algorithms solve (QP^;^), for which Grit. 2 is the most natural, while others 
solve (BPj), for which Grit. 1 is the most natural. In practice, the results when 
applying Grit. 1 or Grit. 2 are not significantly different. 

5.4. Numerical results. 

5.4.1. The case of exactly sparse signals. This first series of experiments 
tests all the algorithms discussed above in the case where the unknown signal is s- 
sparse with s — m/5, m = n/8, and n — 262,144. This situation is close to the 
limit of perfect recovery from noiseless data. The s nonzero entries of the signals 
are generated as described in (3.14). Reconstruction is performed with several values 
of the dynamic range d = 20,40,60,80,100 in dB. The measurement operator is a 
randomly subsampled discrete cosine transform, as in Section 4.1 (with a different 
random set of measurements chosen for each trial). The noise level is set to cr = 0.1. 
The results are reported in Tables 5.1 (Grit. 1) and 5.2 (Grit. 2); each cell in these 
table contains the mean value of Ma (the number of calls of A or A*) over 10 random 
trials, and, in smaller font, the minimum and maximum value of Ma over the 10 
trials. When convergence is not reached after Ma — 20,000, we report DNG (did not 
converge). As expected, the number of calls needed to reach convergence varies a lot 
from an algorithm to another. 

The careful reader will notice that Tables 5.1 and 5.2 do not feature the results 
provided by 11 Js; indeed, while it seems faster than other interior point methods, it 
is still far from being comparable to the other algorithms reviewed here. In these 
experiments ll_ls typically needed 1500 calls to A or A* for reconstructing a 20 dB 
signal with s = m/100 nonzero entries. For solving the same problem with a dynamic 
range of 100 dB, it took 5 hours to converge on a dual core MacPro G5 clocked at 
2.7GHz. 

GPSR performs well in the case of low-dynamic range signals; its performance, 
however, decreases dramatically as the dynamic range increases; Table 5.2 shows that 
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Table 5.1 

Number of function calls Ma averaged over 10 independent runs. The sparsity level s = m/5 
and the stopping rule is Crit. 1 (5.1). 



Method 


20 dB 


40 dB 


60 dB 


80 dB 


100 dB 


NESTA 


446 351/491 


880 719/951 


1701 1581/1777 


4528 4031/4749 


1464 7 7729/15991 


NESTA + Ct 


479 475/485 


551 539/559 


605 589/619 


658 635/679 


685 657/705 


GPSR 


56 44/62 


733 680/788 


5320 4818/5628 


DNC 


DNC 


GPSR + Ct 


305 293/311 


251 245/257 


497 453/531 


1816 1303/2069 


9101 7221/10761 


SpaRSA 


345 327/373 


455 435/469 


542 511/579 


601 563/629 


708 667/819 


SPGLl 


54 37/61 


128 102/142 


209 190/216 


354 297/561 


465 380/562 


FISTA 


68 66/69 


270 261/279 


935 885/969 


3410 2961/3594 


13164 11961/13911 


FPC AS 


156 111/177 


236 157/263 


218 215/239 


351 247/457 


325 313/335 


FPC AS (CO) 


312 212/359 


475 301/538 


434 423/481 


64 1 470/812 


583 567/595 


FPC 


414 394/436 


417 408/422 


571 546/594 


945 852/1038 


3945 2018/4734 


FPC-BB 


148 140/152 


166 158/168 


219 208/250 


264 252/282 


520 320/800 


Bregman-BB 


211 203/225 


270 257/295 


364 355/393 


470 429/501 


572 521/657 



Table 5.2 

Number of function calls Ma averaged over 10 independent runs. The sparsity level s = m/b 
and the stopping rule is Crit. 2 (5.2). 



Method 


20 dB 


40 dB 


60 dB 


80 dB 


100 dB 


NESTA 


446 351/491 


880 719/951 


1701 1581/1777 


4528 4031/4749 


14647 7729/15991 


NESTA + Ct 


479 475/485 


551 539/559 


605 589/619 


658 635/679 


685 657/705 


GPSR 


59 44/64 


736 678/790 


5316 4814/5630 


DNC 


DNC 


GPSR + Ct 


305 293/311 


251 245/257 


511 467/543 


1837 1323/2091 


9127 7251/10789 


SpaRSA 


345 327/373 


455 435/469 


541 509/579 


600 561/629 


706 667/819 


SPGLl 


55 37/61 


138 113/152 


217 196/233 


358 300/576 


470 383/568 


FISTA 


65 63/66 


288 279/297 


932 882/966 


3407 2961/3591 


13160 11955/13908 


FPC AS 


176 169/183 


236 157/263 


218 215/239 


344 247/459 


330 319/339 


FPC AS (CG) 


357 343/371 


475 301/538 


434 423/481 


622 435/814 


588 573/599 


FPC 


416 398/438 


435 418/446 


577 558/600 


899 788/962 


3866 1938/4648 


FPC-BB 


149 140/154 


172 164/174 


217 208/254 


262 248/286 


512 308/790 


Bregman-BB 


211 203/225 


270 257/295 


364 355/393 


470 429/501 


572 521/657 



it does not converge for 80 and 100 dB signals. GPSR with continuation does worse 
on the low dynamic range signals (which is not surprising). It does much better than 
the regular GPSR version on the high dynamic range signals, though it is slower than 
NESTA with continuation by more than a factor of 10. SpaRSA performs well at 
low dynamic range, comparable to NESTA, and begins to outperform GSPR with 
continuation as the dynamic range increases, although it begins to underperform 
NESTA with continuation in this regime. SpaRSA takes over twice as many function 
calls on the 100 dB signal as on the 20 dB signal. 

SPGLl shows good performance with very sparse signals and low dynamic range. 
Although it has fewer iteration counts than NESTA, the performance decreases much 
more quickly than for NESTA as the dynamic range increases; SPGLl requires about 
9x more calls to A at 100 dB than at 20 dB, whereas NESTA with continuation 
requires only about 1.5x more calls. FISTA is almost as fast as SPGLl on the low 
dynamic range signal, but degrades very quickly as the dynamic range increases, 
taking about 200 x more iterations at 100 dB than at 20 dB. One large contributing 
factor to this poor performance at high dynamic range is the lack of a continuation 
scheme. 

FPC performs well at low dynamic range, but is very slow on 100 dB signals. The 
Barzilai-Borwein version was consistently faster than the regular version, but also de- 
grades much faster than NESTA with continuation as the dynamic range increases. 
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Figure 5.1. Sorted wavelet coefficients of the natural image used in the experiments. 

Both FPC Active Set and the Bregman algorithm perform well at all dynamic ranges, 
but again, degrade faster than NESTA with continuation as the dynamic range in- 
creases. There is a slight difference between the two FPC Active set versions (using 
L-BFGS or CG), but the dependence on the dynamic range is roughly similar. 

The performances of NESTA with continuation are reasonable when the dynamic 
range is low. When the dynamic range increases, continuation helps by dividing the 
number of calls up to a factor about 20, as in the 100 dB case. In these experiments, 
the tolerance S is consistently equal to 10^^; while this choice is reasonable when 
the dynamic range is high, it seems too conservative in the low dynamic range case. 
Setting a lower value of S should improve NESTA's performance in this regime. In 
other words, NESTA with continuation might be tweaked to run faster on the low 
dynamic range signals. However, this is not in the spirit of this paper and this is why 
we have not researched further refinements. 

In summary, for exactly sparse signals exhibiting a significant dynamic range, 
1) the performance of NESTA with continuation — but otherwise applied out-of-the- 
box — is comparable to that of state-of-the-art algorithms, and 2) most state-of-the-art 
algorithms are efficient on these types of signals. 

5.4.2. Approximately sparse signals. We now turn our attention to approx- 
imately sparse signals. Such signals are generated via a permutation of the Haar 
wavelet coefficients of a 512 x 512 natural image. The data b are m — n/8 — 32,768 
discrete cosine measurements selected at random. White Gaussian noise with stan- 
dard deviation a — 0.1 is then added. Each test is repeated 5 times, using a different 
random permutation every time (as well as a new instance of the noise vector). Unlike 
in the exactly sparse case, the wavelet coefficients of natural images mostly contain 
mid-range and low level coefficients (see Figure 5.1) which are challenging to recover. 

The results are reported in Tables 5.3 (Grit. 1) and 5.4 (Grit. 2); the results from 
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Table 5.3 

Recovery results of an approximately sparse signal with Crit. I as a stopping rule. 



Method 


< > 


minA/A 


max A/a 


NESTA 


18912 


18773 


19115 


NESTA + Ct 


2667 


2603 


2713 


GPSR 


DNC 


DNC 


DNC 


GPSR + Ct 


DNC 


DNC 


DNC 


SpaRSA 


10019 


8369 


12409 


SPGLl 


1776 


1073 


2464 


FISTA 


10765 


10239 


11019 


FPC Active Set 


DNC 


DNC 


DNC 


FPC Active Set (CG) 


DNC 


DNC 


DNC 


FPC 


DNC 


DNC 


DNC 


FPC-BB 


DNC 


DNC 


DNC 


Bregman-BB 


2045 


2045 


2045 



applying the two stopping criteria are nearly identical. In these series of experiments, 
the performance of SPGLl is quite good but seems to vary a lot from one trial to 
another (Table 5.4). Notice that the concept of an active-set is ill defined in the 
approximately sparse case; as a consequence, the active set version of FPC is not 
much of an improvement over the regular FPC version. FPC is very fast for s-sparse 
signals but lacks the robustness to deal with less ideal situations in which the unknown 
is only approximately sparse. 

FISTA and SpaRSA converge for these tests, but are not competitive with the best 
methods. It is reasonable to assume that FISTA would also improve if implemented 
with continuation. SpaRSA already uses continuation but does not match its excellent 
performance on exactly sparse signals. 

Bregman, SPGLl, and NESTA with continuation all have excellent performances 
(continuation really helps NESTA) in this series of experiments. NESTA with con- 
tinuation seems very robust when high accuracy is required. The main distinguishing 
feature of NESTA is that it is less sensitive to dynamic range; this means that as 
the dynamic range increases, or as the noise level a decreases, NESTA becomes very 
competitive. For example, when the same test was repeated with more noise (ct = 1), 
all the algorithms converged faster. In moving from ct = 1 to ct = 0.1, SPGLl required 
90% more iterations and Bregman required 20% more iterations, while NESTA with 
continuation required only 5% more iterations. 

One conclusion from these tests is that SPGLl, Bregman and NESTA (with con- 
tinuation) are the only methods dealing with approximately sparse signals effectively. 
The other methods, most of which did very well on exactly sparse signals, take over 
10,000 function calls or even do not converge in 20,000 function calls; by comparison, 
SPGLl, Bregman and NESTA with continuation converge in about 2,000 function 
calls. It is also worth noting that Bregman is only as good as the subproblem solver; 
though not reported here, using the regular FPC (instead of FPC-BB) with Bregman 
leads to much worse performance. 

The algorithms which did converge all achieved a mean relative ii error (using 
(4.1) and the high accuracy FISTA solution as the reference) less than 2 • 10~^ and 
sometimes as low as 10~^, except SPGLl, which had a mean relative error of 1.1- 10^"^. 
Of the algorithms that did not converge in 20,000 function calls, FPC and FPC-BB 
had a mean £1 relative error about 5 • 10""^, GPSR with continuation had errors about 
5 • 10~^, and the rest had errors greater than 10^^. 
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Table 5.4 

Recovery results of an approximately sparse signal with Crit. 2 as a stopping rule. 



Method 


< > 


rain J\f A 


max A/a 


NESTA 


18912 


18773 


19115 


NESTA + Ct 


2667 


2603 


2713 


GPSR 


DNC 


DNC 


DNC 


GPSR + Ct 


DNC 


DNC 


DNC 


SpaRSA 


10021 


8353 


12439 


SPGLl 


1776 


1073 


2464 


FISTA 


10724 


10197 


10980 


FPC Active Set 


DNC 


DNC 


DNC 


FPC Active Set (CG) 


DNC 


DNC 


DNC 


FPC 


DNC 


DNC 


DNC 


FPC-BB 


DNC 


DNC 


DNC 


Bregman-BB 


2045 


2045 


2045 



6. An all-purpose algorithm. A distinguishing feature is that NESTA is able 
to cope with a wide range of standard regularizing functions. In this section, we 
present two examples: nonstandard minimization and total- variation minimization. 

6.1. Nonstandard sparse reconstruction: £1 analysis. Suppose we have a 
signal X € M", which is assumed to be approximately sparse in a transformed do- 
main such as the wavelet, the curvelet or the time-frequency domains. Let W be the 
corresponding synthesis operator whose columns are the waveforms we use to synthe- 
size the signal x = Wa (real- world signals do not admit an exactly sparse expansion); 
e.g. the columns may be wavelets, curvelets and so on, or both. We will refer to W* as 
the analysis operator. As before, we have (possibly noisy) measurements b = Ax'^ + z. 
The synthesis approach attempts reconstruction by solving 

minimize II^IUi (f. 1 \ 

subject to \\b- AWaWe^ < e, ^ ' ' 



while the analysis approach solves the related problem 

(6.2) 



minimize ||Ty*a;||£j 
subject to ||6 — Ax\\e2 — ^■ 

If W is orthonormal, the two problems are equivalent, but in general, these give 
distinct solutions and current theory explaining the differences is still in its infancy. 
The article [29] suggests that synthesis may be overly sensitive, and argues with 
geometric heuristics and numerical simulations that analysis is sometimes preferable. 

Solving ^i-analysis problems with NESTA is straightforward as only Step 1 needs 
to be adapted. We have 

Uix) = m^^{u,W*x)-'^\\u\\l, 
and the gradient at x is equal to 

VUix) = Wu^ix); 

here, u^{x) is given by 



{u^{x))[i] = 



\ii~\w*x)% i^\{w*x)\l]\<^Ji, 

I sgn((W^*a;)[i]), otherwise. 
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Steps 2 and 3 remain unchanged. The computational complexity of the algorithm 
is then increased by an extra term, namely 2Cw where Cw is the cost of applying W 
or W* to a vector. In practical situations, there is often a fast algorithm for applying 
W and W*, e.g. a fast wavelet transform [39], a fast curvelet transform [11], a fast 
short-time Fourier transform [39] and so on, which makes this a low-cost extra step^. 

6.2. Numerical results for nonstandard £i minimization. Because NESTA 
is one of very few algorithms that can solve both the analysis and synthesis problems 
efficiently, we tested the performance of both analysis and synthesis on a simulated 
real- world signal from the field of radar detection. The test input is a superposition 
of three signals. The first signal, which is intended to make recovery more difficult 
for any smaller signals, is a plain sinusoid with amplitude of 1000 and frequency near 
835 MHz. 

A second signal, similar to a Doppler pulse radar, is at a carrier frequency of 2.33 
GHz with maximum amplitude of 10, a pulse width of 1 fis and a pulse repetition 
interval of 10 /is; the pulse envelope is trapezoidal, with a 10 ns rise time and 40 ns fall 
time. This signal is more than 40 dB lower than the pure sinusoid, since the maximum 
amplitude is 100 x smaller, and since the radar is nonzero only 10% of the time. The 
Doppler pulse was chosen to be roughly similar to a realistic weather Doppler radar. 
In practice, these systems operate at 5 cm or 10 cm wavelengths (i.e. 6 or 3 GHz) and 
send out short trapezoidal pulses to measure the radial velocity of water droplets in 
the atmosphere using the Doppler effect. 

The third signal, which is the signal of interest, is a frequency-hopping radar 
pulse with maximum amplitude of 1 (so about 20 dB beneath the Doppler signal, and 
more than 60 dB below the sinusoid). For each instance of the pulse, the frequency is 
chosen uniformly at random from the range 200 MHz to 2.4 GHz. The pulse duration 
is 2 fis and the pulse repetition interval is 22 /is, which means that some, but not all, 
pulses overlap with the Doppler radar pulses. The rise time and fall time of the pulse 
envelope are comparable to the Doppler pulse. Frequency-hopping signals may arise 
in applications because they can be more robust to interference and because they can 
be harder to intercept. When the carrier frequencies are not known to the listener, 
the receiver must be designed to cover the entire range of possible frequencies (2.2 
GHz in our case). While some current analog-to-digital converters (ADC) may be 
capable of operating at 2.2 GHz, they do so at the expense of low precision. Hence 
this situation may be particularly amenable to a compressed sensing setup by using 
several slower (but accurate) ADC to cover a large bandwidth. 

We consider the exact signal to be the result of an infinite-precision ADC oper- 
ating at 5 GHz, which corresponds to the Nyquist rate for signals with 2.5 GHz of 
bandwidth. Measurements are taken using an orthogonal Hadamard transform with 
randomly permuted columns, and these measurements were subsequently sub-sampled 
by randomly choosing m = .3n rows of the transform (so that we undersample Nyquist 
by 10/3). Samples are recorded for T = 209.7/iS, which corresponds to n = 2^'^. White 
noise was added to the measurements to make a 60 dB signal-to-noise ratio (SNR) 
(note that the effective SNR for the frequency- hopping pulse is much lower). The 
frequencies of the sinusoid and the Doppler radar were chosen such that they were 
not integer multiples of the lowest recoverable frequency fmin — 1/(27") • 

For reconstruction, the signal is analyzed with a tight frame of Gabor atoms 



•^The ability to solve the analysis problem also means that NESTA can easily solve reweighted 
£l problems [16] with no change to the code. 
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that is approximately 5.5 x overcomplete. The particular parameters of the frame are 
chosen to give reasonable reconstruction, but were not tweaked excessively. It is likely 
that differences in performance between analysis and synthesis are heavily dependent 
on the particular dictionary. 

To analyze performance, we restrict our attention to the frequency domain in 
order to simplify comparisons. The top plot in Figure 6.1 shows the frequency com- 
ponents of the original, noiseless signal. The frequency hopping pulse barely shows 
up since the amplitude is 1000 x smaller than the sinusoid and since each frequency 
only occurs for 1 /is (of 210 /iS total). 

The bottom plots in Figure 6.1 show the spectrum of the recovered signal using 
analysis and synthesis, respectively. For this test, analysis does a better job at finding 
the frequencies belonging to the small pulse, while synthesis does a better job recreat- 
ing the large pulse and the pure tone. The two reconstructions used slightly different 
values of fj, to account for the redundancy in the size of the dictionary; otherwise, 
algorithm parameters were the same. In the analysis problem, NESTA took 231 calls 
to the analysis/synthesis operator (and 231 calls to the Hadamard transform); for 
synthesis, NESTA took 1378 calls to the analysis/synthesis operator (and 1378 to the 
Hadamard transform). With NESTA, synthesis is more computationally expensive 
than analysis since no change of variables trick can be done; in the synthesis case, W 
and W* are used in Step 2 and 3 while in the analysis case, the same operators are used 
once in Step 1 (this is accomplished by the previously mentioned change-of-variables 
for partial orthogonal measurements). 

As emphasized in [29], when W is overcomplete, the solution computed by solving 
the analysis problems is likely to be denser than in the synthesis case. In plain 
English, the analysis solution may seem "noisier" than the synthesis solution. But 
the compactness of the solution of the synthesis problem may also be its weakness: 
an error on one entry of a may lead to a solution that differs a lot. This may explain 
why the frequency-hopping radar pulse is harder to recover with the synthesis prior. 

Because all other known first-order methods solve only the synthesis problem, 
NESTA may prove to be extremely useful for real-world applications. Indeed, this 
simple test suggests that analysis may sometimes be much preferable to synthesis, 
and given a signal with 2^° samples (too large for interior point methods), we know 
of no other algorithm that can return the same results. 

6.3. Total-variation minimization. Nesterov's framework also makes total- 
variation minimization possible. The TV norm of a 2D digital object j] is given 
by 

where Di and D2 are the horizontal and vertical differences 

{Dix)[i,j] = x[i + l,j] - x[i,j], 
{D2x)[i,j] = x[i,j + 1]- x[i,j]. 

Now the TV norm can be expressed as follows: 

||x||tv = max(u,_Dx), (6.3) 

where u — [ui,U2]* G Q,d if and only for each (i,j), + 1*2 < 1, and 

D = [_Di,_D2]*- The key feature of Nesterov's work is to smooth a well-structured 



30 



Spectrum estimate of the Nyquist-sampled signal 

Spectrum 
- - Exact frequency of signaM 

Exact frequency of Doppler Radar 

Exact frequencies of frequency hopping pulse 




1 1.5 
Frequency (Hz) 
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Figure 6.1. Top: spectrum estimate of the exact signal, no noise. The pure tone at 60 dB 
and the Doppler radar at 20 dB dominate the dB frequency hopping pulses. Middle: spectrum 
estimate of the recovered signal using analysis prior, with 60 dB SNR. The spectrum appears noisy, 
but the frequency hopping pulses stand out. Bottom: spectrum estimate of the recovered signal using 
synthesis prior, with 60 dB SNR. The spectrum appears cleaner, but the small dB pulses do not 
appear. 
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nonsmooth function as follows (notice in (6.3) the similarity between the TV norm 
and the £i norm): 

max(M, Dx) — fj,pd{u). 

uQd 

Choosing Pd{u) = provides a reasonable prox-function that eases the compu- 

tation of V/^. Just as before, changing the regularizing function only modifies Step 
1 of NESTA. Here, 



/^(x) = max(M,£ia;) - ^\\u\\l^. 
ueQd z 



Then as usual, 



Vf^ix) = D*u^{x), 
where Up(x) is of the form [ui,M2]* and for each a £ {1,2}, 



/i ^{DaX)[i,jl if \\^x[i,i]\\ < /X, 

1 1 Vx[i , j] 1 1 ^ (£)a2;) [«, j] , otherwise. 



The application of D and D* leads to a negligible computational cost (sparse matrix- 
vector multiplications). 

6.4. Numerical results for TV minimization. We are interested in solving 

minimize \\x\\tv ,r..^ 
subject to — — ^- 

To be sure, a number of efficient TV-minimization algorithms have been proposed 
to solve (6.4) in the special case A ^ I (denoising problem), see [17, 22, 33]. In 
comparison, only a few methods have been proposed to solve the more general problem 
(6.4) even when A is a projector. Known methods include interior point methods (£i- 
magic) [10], proximal-subgradient methods [5, 19], Split-Bregman [33], and the very 
recently introduced RecPF'* [55] , which operates in the special case of partial Fourier 
measurements. Roughly, proximal gradient methods approach the solution to (6.4) 
by iteratively updating the current estimate x^ as follows: 

Xfe+i = ProxTK7 i^k + aA*{b ~ Axk)) , 

where Prox^y,-! is the proximity operator of TV, see [20] and references therein, 

VroyiTVni^) = argmin7||x||TV + ;^||a; - z\\%- 

Evaluating the proximity operator at z is equivalent to solving a TV denoising prob- 
lem. In [5], the authors advocate the use of a side algorithm (for instance ChamboUe's 
algorithm [17]) to evaluate the proximity operator. There are a few issues with this 
approach. The first is that side algorithms depend on various parameters, and it is 
unclear how one should select them in a robust fashion. The second is that these algo- 
rithms are computationally demanding which makes them hard to apply to large-scale 
problems. 



available at http://www.caajii.rice.edu/~optiniization/Ll/RecPF/. 
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To be as fair as possible, we decided to compare NESTA with algorithms for which 
a code has been publicly released; this is the case for the newest in the family, namely, 
RecPF (as ^i-magic is based on an interior point method, it is hardly applicable to 
this large-scale problem). Hence, we propose comparing NESTA for TV minimization 
with RecPF. 

Evaluations are made by comparing the performances of NESTA (with continu- 
ation) and RecPF on a set of images composed of random squares. As in Section 5, 
the dynamic range of the signals (amplitude of the squares) varies in a range from 20 
to 40 dB. The size of each image x is 1024 x 1024; one of these images is displayed in 
the top panel of Figure 6.2. The data b are partial Fourier measurements as in [13]; 
the number of measurements m — n/10. White Gaussian noise of standard deviation 
(T = 0.1 is added. The parameters of NESTA are set up as follows: 

XQ^A*b, fi^ 0.2, 5=10-^, T^5, 

and the initial value of /i is 

/io = 0.9 max ||Va;o[«, 

The maximal number of iterations is set to Imax = 4,000. As it turns out, TV 
minimization from partial Fourier measurements is of significant interest in the field 
of Magnetic Resonance Imaging [38]. 

As discussed above, RecPF has been designed to solve TV minimization recon- 
struction problems from partial Fourier measurements. We set the parameters of 
RecPF to their default values except for the parameter tol_rel_inn that is set to 
10~^. This choice makes sure that this converges to a solution close enough to 
NESTA 's output. Figure 6.2 shows the the solution computed by RecPF (bottom 
left) and NESTA (bottom right). 

The curves in Figure 6.3 show the number of calls to A or A*; mid-points are 
averages over 5 random trials, with error bars indicating the minimum and maximum 
number of calls. Here, RecPF is stopped when 

||a;RccPF||Ty < l-05||a;7v||Ty, 
\\b ~ Axu,ccPF\\e2 < 1-05||6- AxTvllfa, 

where xn is the solution computed via NESTA. As before continuation is very efficient 
when the dynamic range is high (typically higher than 40 dB). An interesting feature is 
that the numbers of calls are very similar over all five trials. When the dynamic range 
increases, the computational costs of both NESTA and RecPF naturally increase. 
Note that in the 60 and 80 dB experiments, RecPF did not converge to the solution 
and this is the reason why the number of calls saturates. While both methods have 
a similar computational cost in the low-dynamic range regime, NESTA has a clear 
advantage in the higher-dynamic range regime. Moreover, the number of iterations 
needed to reach convergence with NESTA with continuation is fairly low — 300-400 
calls to A and A* — and so this algorithm is well suited to large scale problems. 

7. Discussion. In this paper, we have proposed an algorithm for general sparse 
recovery problems, which is based on Nesterov's method. This algorithm is accurate 
and competitive with state-of-the-art alternatives. In fact, in applications of greatest 
interest such as the recovery of approximately sparse signals, it outperforms most 
of the existing methods we have used in our comparisons and is comparable to the 
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Figure 6.2. Top: original image of size 1024x 1024 with a dynamic range of about 40 dB. 
Bottom-Left: RecPF solution. Bottom-Right: NESTA solution. 
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Figure 6.3. Number of calls to A and A* as a function of the dynamic range. Solid line: 
NESTA with continuation. Dashed line: NESTA. Dotted line: RecPF. Dash-dotted: maximum 
number of iterations. In the 60 and 80 dB experiments, RecPF did not converge to the solution and 
this is the reason why the number of calls saturates. 
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best. Further, what is interesting here, is that we have not attempted to optimize 
the algorithm in any way. For instance, we have not optimized the parameters {a^} 
and {rfe}, or the number of continuation steps as a function of the desired accuracy 
i5, and so it is expected that finer tuning would speed up the algorithm. Another 
advantage is that NESTA is extremely flexible in the sense that minor adaptations 
lead to efficient algorithms for a host of optimization problems that are crucial in the 
field of signal/image processing. 

7.1. Extensions. This paper focused on the situation in which A* A is a pro- 
jector (the rows of A are orthonormal) . This stems from the facts that 1) the most 
computationally friendly compressed sensing are of this form, and 2) it allows fast 
computations of the two sequence of iterates {yk} and {zk}. It is important, however, 
to extend NESTA as to be able to cope with a wider range of problem in which A* A 
is not a projection (or not diagonal). 

In order to do this, observe that in Steps 2 and 3, we need to solve problems of 
the form 

yk = argmin||a; - q\\l^, 

xeQp 

for some q, and we have seen that the solution is given by yk — Vq^^q), where Vq^ is 
the projector onto Qp := {a; : \\Ax — b\\g^ < e}. The solution is given by 

yk^{I + \A*A)-\q+XA*b) (7.1) 

for some A > 0. When the eigenvalues of A* A are well clustered, the right-hand side 
of (7.1) can be computed very efficiently via a few conjugate gradients (CG) steps. 
Note that this is of direct interest in compressed sensing applications in which A is 
a random matrix since in all the cases we are familiar with, the eigenvalues of A* A 
are tightly clustered. Hence, NESTA may be extended to general problems while 
retaining its efficiency, with the proviso that a good rule for selecting A in (7.1) is 
available; i.e. such that \\Ayk — b\\£^ — e unless q e Qp. Of course, one can always 
eliminate the problem of finding such a A by solving the unconstrained problem (QP;^) 
instead of (BP^). In this case, each NESTA iteration is actually very cheap, no matter 
how A looks like. 

Finally, we also observe that Nesterov's framework is likely to provide efficient 
algorithms for related problems, which do not have the special £i + £2 structure. One 
example might be the Dantzig selector, which is a convenient and flexible estimator 
for recovering sparse signals from noisy data [15]: 

minimize \\x\\i^ , , 

subjectto \\A*{b- AX)\\e^ <S. ^ ' 

This is of course equivalent to the unconstrained problem 

minimize Aj|a;||£j + j| A*(6 — AX)||£^ 

for some value of A. Clearly, one could apply Nesterov's smoothing techniques to 
smooth both terms in the objective functional together with Nesterov's accelerated 
gradient techniques, and derive a novel and efficient algorithm for computing the 
solution to the Dantzig selector. This is an example among many others. Another 
might be the minimization of a sum of two norms, e.g. an £1 and a TV norm, under 
data constraints. 
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7.2. Software. In the spirit of reproducible research [26], a Matlab version of 
NESTA will be made available at: http://www.acm.caltecli.edu/~nesta/ 
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